Chiara Baldoni 285441, Sofia Bruni 285231, Francesca Romana Sanna 282491, Mattia Sebastiani 288071
The dataset under analysis is the “Brazilian Houses” dataset, which contains information about houses for rent in different Brazilian cities. This project aims to predict rent prices in major Brazilian cities, providing crucial insights for new companies entering the real estate market. By identifying key factors influencing rental prices and segmenting the market geographically, we equip newcomers with the knowledge needed to make informed investment decisions and optimize their pricing strategies.
We begin our analysis by loading the dataset and performing data cleaning and preparation steps to ensure the data is ready for the subsequent analysis and modeling. This includes handling missing values, duplicates, and outliers, as well as converting data types and renaming columns for clarity.
The dataset consists of 10962 observations with 13, variables which provide information about the houses. In particular, there are 10 numerical variables and 3 categorical variables.
## [1] "Null values before preprocessing"
## city area rooms bathroom parking.spaces
## 0 0 0 0 0
## floor animal furniture hoa rent
## 2461 0 0 0 0
## property_tax fire_insurance
## 0 0
## There are 0 null values after preprocessing
## There are 0 duplicates rows before preprocessing
We found 2461 missing values in the floor variable. We decided to interpret these missing values as houses having only the ground floor, as the ground floor is usually not counted in the floor number. Therefore, we replaced them with 0.
Additionally, we found there no duplicate rows.
After handling null values, here is what the cleaned dataset looks like:
## city area rooms bathroom
## Belo Horizonte:1209 Min. : 11.0 Min. : 1.00 Min. :1.000
## Campinas : 823 1st Qu.: 59.0 1st Qu.: 2.00 1st Qu.:1.000
## Porto Alegre :1154 Median : 95.0 Median : 3.00 Median :1.000
## Rio de Janeiro:1431 Mean : 152.5 Mean : 2.54 Mean :1.283
## São Paulo :5712 3rd Qu.: 190.0 3rd Qu.: 3.00 3rd Qu.:1.000
## Max. :46335.0 Max. :13.00 Max. :9.000
## parking.spaces floor animal furniture
## Min. :0.00 Min. : 0.000 acept :8073 furnished :2515
## 1st Qu.:1.00 1st Qu.: 1.000 not acept:2256 not furnished:7814
## Median :2.00 Median : 3.000
## Mean :1.33 Mean : 5.102
## 3rd Qu.:2.00 3rd Qu.: 8.000
## Max. :8.00 Max. :301.000
## hoa rent property_tax fire_insurance
## Min. : 0 Min. : 450 Min. : 0.0 Min. : 3.00
## 1st Qu.: 180 1st Qu.: 1599 1st Qu.: 41.0 1st Qu.: 21.00
## Median : 571 Median : 2750 Median : 130.0 Median : 37.00
## Mean : 1092 Mean : 3967 Mean : 377.1 Mean : 54.28
## 3rd Qu.: 1289 3rd Qu.: 5000 3rd Qu.: 390.0 3rd Qu.: 70.00
## Max. :1117000 Max. :45000 Max. :313700.0 Max. :677.00
By looking at the summary, we noticed suspiciously high maximums values in the features of area, hoa (monthly homeowner association tax), property tax, and fire insurance, suggesting that some houses are not representative of the general market conditions. For each selected feature, we create boxplots and histogram density plots to visualize their distributions.
For space reasons, we only show the plots for area. The plots for the other variables are similar to those shown below.
The plots revealed, as suspected, right-skewed distributions, indicating the presence of high-value outliers. To better understand the data without the distortions caused by these extreme values, we considered generating log-scaled versions of the plots to normalize the distribution. However, this approach might have underestimated the magnitude of the outliers, and in turn destabilize our results and the accuracy of our models. Therefore, we decide to remove the outliers.
## Number of outliers detected: 268
## Most common rent value among the outliers: 15000
## Frequency of most common rent value among the outliers: 231
In total, 268 houses with extreme values were excluded from our dataset. We decided to use the z-score method with a threshold of 3 to identify and remove outliers, ensuring that we addressed only the most extreme outliers without eliminating significant portions of data. In doing so, we noticed that the most of the removed houses had a common rent value of 15,000, suggesting potential data entry errors. To confirm the effectiveness of our chosen threshold, we utilized Q-Q plots, which are tools for assessing whether the data distribution follows the expected normal distribution. The Q-Q plots demonstrated that deviations from normality primarily occurred at data points with z-scores beyond our set threshold. We also tested different thresholds. Lowering the threshold to 1 or 2 would have unnecessarily excluded a large portion of viable data, while a higher threshold like 4 would not sufficiently filter out potential errors. Thus, a threshold of 3 effectively balanced the need to remove outliers and retain significant data.
After cleaning the data, we proceed with Exploratory Data Analysis (EDA). We start by analyzing the distribution of categorical variables, followed by a correlation analysis of numerical variables and simple regression models to identify the most significant predictors of rent prices. This step is crucial for understanding the relationships between variables, identifying patterns, and selecting the most impactful features for building robust predictive models in the next phase of our analysis.
We assess whether the categorical variables of furniture, animal policy, and city have any impact rent prices. For each feature, we create violin plots complemented by pie charts and histograms. Violin plots are particularly effective for this type of analysis because they merge the attributes of box plots with density plots, illustrating the distribution of rent prices and visually representing the probability density at various values. This approach allows us to see not only the median rent prices and interquartile ranges, but also the overall distribution shapes.
## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[arrange]
## 2 2 (2-2,1-1) arrange gtable[arrange]
The violin plot for furniture revealed the significant impact of furnishing status on rent prices, with furnished homes commanding higher rents, evident from their broader distribution at the upper end. This observation aligns with the pie chart, which shows a larger market share of unfurnished homes, as they are more affordable. The violin plot for cities revealed that São Paulo has higher rents compared to other cities, marked by a higher median and a broader distribution, suggesting a more expensive and varied rental market. The histogram supports this, showing that São Paulo has a wider range of rent prices due to the presence of both low-end and luxury properties. In contrast, the allowance of animals does not have a significant impact on rent prices, as indicated by the similar distribution shapes in both categories.
We perform an Analysis of Variance (ANOVA) test to confirm whether the city variable significantly influences rent prices. The ANOVA test compares the means of rent prices across different cities to determine if the differences are statistically significant.
## Df Sum Sq Mean Sq F value Pr(>F)
## city 4 6.881e+09 1.720e+09 218 <2e-16 ***
## Residuals 10056 7.935e+10 7.891e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results showed a p-value less than 0.05, indicating that the city variable significantly influences rent prices and confirming our earlier observation.
We further explore the impact of location on rent prices by creating density plots of rent distributions by city with dashed average rent lines for each city, and density plots of rent prices by city and furniture status. In doing so, we highlight the variations in rent prices across locations and compare the distributions of rent prices for furnished and unfurnished homes in different cities.
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
The plot on the left revealed significant variations in rent prices across locations, with São Paulo having the highest average rent, as indicated by the dashed lines, and a broader distribution of prices compared to other cities, as indicated by the wider density plot. The plot on the right showed that furnished homes generally have higher rents than unfurnished homes across all cities. Again, São Paulo has the highest rent prices for both furnished and unfurnished homes, followed by Rio de Janeiro, Porto Alegre, Campinas, and Belo Horizonte. This suggests that the city has a more significant impact on rent prices than furniture status, as the rent differences between cities are more pronounced than those between furnished and unfurnished homes within the same city. These aligns with our earlier findings and confirms the importance of location.
We measure the correlation between numerical variables in our dataset to assess their relevance in predicting rent prices. We create a correlation matrix and a heatmap to visualize their linear relationships and identify significant predictors.
| x | |
|---|---|
| fire_insurance | 0.9863253 |
| area | 0.6500048 |
| property_tax | 0.5697895 |
| rooms | 0.5327854 |
| parking.spaces | 0.4520477 |
| hoa | 0.4127875 |
| bathroom | 0.1649667 |
| floor | 0.0812383 |
We sorted the correlation values in descending order (excluding the target variable) to identify the top four variables most correlated with rent. The heatmap revealed that fire insurance has a near-perfect correlation with rent (0.99), followed by area (0.65), property tax (0.57), and rooms (0.53). These variables are expected to have a significant impact on rent prices and will serve as the basis for our predictive model in the next phase of our analysis. While it is expected that rooms and area are positively correlated with rent, as larger properties tend to have more rooms and higher rents, the remarkably high correlation with fire insurance is surprising and warrants further investigation. This correlation might indicate that the quality and location of the property significantly influence rent prices, making fire insurance a crucial predictor of rent amounts. Similarly, the correlation with property tax is intriguing, as higher property taxes might indicate more luxurious properties with higher rents.
In addition, apart from rent, area shows strong correlations with fire insurance, property tax, and rooms. Fire insurance also shows a strong correlation with property tax and rooms. However, these findings do not raise multicollinearity issues, as the only two highly correlated variables that surpass a 0.8 correlation threshold are fire insurance and our target. Therefore, there is no need to remove any variables from the dataset.
We proceed investigating the relationships between the top four correlated variables we identified and rent prices, as the other matrix variables are not as significant, using simple regression models. We create scatter plots with regression lines to display these linear relationships, using a 95% confidence interval. This interval is standard in statistical analysis, providing a reliable range within which we expect the true regression line to lie 95% of the time.
The regression lines indicated positive correlations for all variables. Fire insurance showed the strongest and most precise relationship with rent, with closely clustered points and a narrow confidence interval, suggesting highly reliable predictions even when working as a single predictor. Area and property tax also influence rent but exhibit more variability, making their alone predictions less precise compared to fire insurance. The number of rooms showed significant variability, as indicated by the broad confidence interval, suggesting other factors may also play a substantial role in determining rent prices and should be investigated further.
Additionally, we conduct a multivariate analysis between the variables that mostly impact rent prices to understand their combined effects. We create boxplots to visualize the distribution of fire insurance, property tax, and area across different cities. We do not include the animal policy and number of rooms in this analysis as they showed less significant correlations with rent prices compared to the other variables in the previous phases. We also don’t include the furniture status as it relates more to the rent prices themselves rather than the other variables.
The boxplots showed that São Paulo has the highest fire insurance, property tax, and area values, indicating that it is the most expensive city in terms of rent prices. This aligns with our previous findings that São Paulo has the highest average rent prices compared to other cities. Furthermore, we can observe that the distributions of fire insurance, property tax, and area vary significantly across cities, suggesting that location plays an important role in determining these variables and, consequently, rent prices. For instance, cities like Belo Horizonte and Campinas generally showed lower values in these variables compared to São Paulo, reflecting their lower rent prices.
Before proceeding with the full rent predictive model aimed at providing the best result, we test some lower-dimensional models to investigate our correlation matrix and simple regression model findings that fire insurance and area are the most significant predictors of rent prices.
The lower dimensional models we test are: 1. The top 2 AIC model, which includes only the two most correlated variables with rent, fire insurance and area 2. The no fire insurance AIC model, which includes all variables except fire insurance 3. The complete AIC model, which includes all variables 4. The feature engineering AIC model, which includes interaction terms between the variables. In particular, we include interaction terms between area and rooms, as they showed the second-highest correlation in the matrix, and between HOA (monthly homeowners association tax) and property tax to capture the impact of these combined fees on rent prices.
Since AIC and BIC perform similarly, we chose AIC to select the best model as we don’t have so many predictors to prefer a more penalized and strict approach. Unlike BIC, AIC avoids over-penalizing predictors, ensuring the inclusion of critical features like fire insurance and providing more flexibility. We leverage this flexibility to evaluate models with and without fire insurance, given its near-perfect correlation with rent prices, to confirm their robustness.
We split the data into training and testing sets, create the models using stepwise AIC, and evaluate their performance using Root Mean Squared Error (RMSE) and R-squared (R2) values. The RMSE measures the average difference between the predicted and actual rent prices, thus the lower the RMSE, the better the model performance, while the R2 value indicates the proportion of the variance in rent prices that is predictable from the independent variables, thus the closer to 1, the better the model fits the data. We create plots to compare the RMSE, R2, and AIC values of the models, as well as histograms and Q-Q plots of the residuals to assess their normality and reliability.
The plots revealed the FE Model is the best model, showing the lowest RMSE and highest R² value, explaining almost 99% of the variability. The interaction terms in this model provided the best predictive performance, indicating that the combined effects of these variables might have slightly reduced noise. The Q-Q plots for this model displayed the most normally distributed residuals, making it again the most reliable model. Additionally, feature engineering showed that the animal feature was not helpful for prediction. In contrast, the No Fire Insurance AIC Model performed poorly, with the highest RMSE and lowest R² values, confirming the crucial importance of fire insurance as a predictor of rent prices. The Top 2 AIC Model, which includes only fire insurance and area, had the second-lowest RMSE and R² values, further confirming that these two variables are significant predictors.
In this part of the code we create a function that computes the R-squared value, which measures how well the model’s predictions fit the actual data. Then another one that encodes categorical variables into binary (one-hot) encoded matrices and then combines them with numeric data. After that we have split the dataset into training and testing sets based on a given ratio (default is 80% training and 20% testing).
The Elastic Net model is a type of linear regression that combines two regularization techniques: Lasso (L1) and Ridge (L2). This combination allows the model to perform feature selection and shrinkage, which helps in handling multicollinearity and prevents overfitting. The alpha parameter determines the mix between L1 and L2 penalties (alpha = 0.5 gives an equal mix of Lasso and Ridge). Elastic Net provides accurate predictions by preventing overfitting and capturing the essential patterns in the data. Effective management of the bias-variance tradeoff is critical in real estate forecasting, where overfitting can lead to poor predictions on unseen data.
In real estate data, predictor variables (e.g., area, hoa, property_tax) often exhibit multicollinearity, where some variables are highly correlated. The Elastic Net model mitigates this issue by regularizing the coefficients, thus preventing overfitting. By balancing L1 and L2 penalties, the Elastic Net model provides a robust approach to feature selection and coefficient shrinkage, leading to more stable and interpretable models. This ability to zero out coefficients helps in identifying the most important factors that influence rental prices, leading to simpler and more interpretable models.
A Generalized Additive Model (GAM) is a type of statistical model that extends traditional linear models by allowing non-linear relationships between the predictor variables and the response variable. GAMs achieve this by using smoothing functions, which can model more complex and flexible patterns in the data without assuming a specific parametric form. GAMs offer flexibility in capturing complex relationships without overfitting, which is crucial in a diverse and varied field like real estate.
With Feature Engineering:
By including interactions and smoothing them, the model can capture more
complex relationships. In real estate, relationships between predictors
(e.g., area, number of rooms, HOA fees) and the target variable (rent)
are often non-linear. For example, the effect of area on rent might
increase more steeply for larger apartments but plateau for very large
sizes. The use of smoothing functions provides a more nuanced
understanding of how different factors influence rent, allowing for
better interpretability compared to black-box models. Feature
engineering with interactions and smoothing allows the model to capture
complex patterns that might be missed by simpler models. By including
interaction terms in the GAM, the model can account for these combined
effects, providing a more comprehensive view of the determinants of
rent.
A Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the average prediction of individual trees for regression tasks. It combines the strengths of multiple trees to create a robust model that typically performs better than a single decision tree. Comparing different Random Forest models, including those with and without certain features or with feature engineering, helps in identifying the best approach. Comparative analysis provides a deeper understanding of which model variations and feature sets lead to the most accurate and generalizable predictions.
Random Forests handle correlated predictors better than linear models, as the random feature selection helps in mitigating the multicollinearity issue. Random Forests can capture non-linear relationships and complex interactions between variables that are often present in real estate data. This ability makes Random Forests particularly useful for modeling rental prices, where the relationship between predictors and the target variable is not strictly linear. Random Forests provide insights into the importance of different features by evaluating their impact on the prediction accuracy. Feature importance scores help in identifying key drivers of rental prices, aiding in feature selection and understanding of the underlying factors influencing the market.
The graph consists of two bar plots that display the performance of different models in terms of RMSE (Root Mean Square Error) and R-squared.
Best Performing Models:
- Top RMSE Performers: The models with the lowest RMSE values include
the “Feature Engineering & HT,” “Feature Eng AIC Model,” and “GAM
Splines.” These models are more effective at minimizing prediction
errors, indicating they can predict rental prices more accurately.
- Top R-Squared Performers: The “Feature Eng AIC Model” and “GAM
Splines” also show high R-squared values, indicating that they explain a
significant proportion of the variance in rental prices.
Models that incorporated feature engineering, such as the “Random Forest with Feature Engineering” and “Feature Engineering & HT,” performed well, suggesting that creating interaction terms and other derived features improves the model’s ability to capture complex relationships in the data. In real estate, feature engineering can help to capture nuanced relationships between variables like property size and location, which can significantly impact rental prices.
Comparing Regular and Simplified Models:
Models like the “Top 2 AIC Model” and “No fire_insurance AIC Model” have
higher RMSE and lower R-squared values, indicating that they are less
effective at predicting rental prices. Simpler models or models that
exclude key variables might fail to capture the full complexity of the
rental market, leading to poorer performance.
Effect of Regularization and Tuning:
The “Elastic Net” model, which uses regularization to prevent
overfitting, shows competitive performance. This model balances bias and
variance by combining Lasso and Ridge regression techniques.
Regularization techniques are crucial in preventing overfitting,
especially in real estate data where the predictor variables might be
highly correlated.
Finally we can say that the rental market is influenced by numerous factors, including location, property characteristics, and economic conditions. Models that capture these complex relationships, like GAMs and Random Forests with feature engineering, are well-suited for predicting rental prices. Using advanced modeling techniques allows you to account for the complexity and variability in real estate data, leading to more accurate and reliable predictions. Choosing the right model based on performance metrics ensures that we have a reliable tool for predicting rental prices and understanding market dynamics.
The graph represents the performance of various models in terms of RMSE (Root Mean Square Error) and R-squared values. Models are sorted by their R-squared values in descending order, allowing a clear comparison of how well each model explains the variability in the data. This comparative analysis helps identify the strengths and weaknesses of different modeling approaches and guides future model improvements by focusing on techniques that have shown better performance.
| Model | RMSE | R2 | |
|---|---|---|---|
| 7 | Random Forest with Feature Engineering & HT | 226.0638 | 0.9943584 |
| 3 | GAM with Feature Engineering | 295.0359 | 0.9900848 |
| 2 | GAM Splines | 302.7653 | 0.9895709 |
| 1 | Elastic Net | 336.1832 | 0.9871174 |
| 4 | Random Forest | 376.1842 | 0.9861984 |
| 6 | Random Forest with Feature Engineering | 435.1063 | 0.9815419 |
| 5 | Random Forest No Fire Insurance | 1726.2206 | 0.6625408 |
Trend in RMSE Values:
- The RMSE values range from approximately 226 to 1938.
- Lower RMSE values are observed in models like “Random Forest with
Feature Engineering & HTT” and “GAM with Feature Engineering,”
indicating better predictive accuracy.
- Higher RMSE values are seen in models like “Random Forest No Fine
Insurance” and “No More Insurance AIC Model,” which suggests poorer
predictive accuracy.
Trend in R² Values:
- The R² values are high for most models, ranging from 0.572 to 0.994.
Higher R² values indicate that these models explain a large portion of
the variance in the data.
- Models like “Random Forest with Feature Engineering & HTT” and
“GAM with Feature Engineering” have R² values close to 0.99, indicating
excellent performance.
- Models like “No More Insurance AIC Model” with an R² of 0.572 explain
less variance and are less reliable in predicting outcomes.
Patterns and insights:
- Models involving “Feature Engineering” generally perform well, with
low RMSE and high R² values, suggesting that including feature
engineering improves model performance.
- “Random Forest” models show a significant range in performance. While
“Random Forest with Feature Engineering & HTT” has one of the lowest
RMSE and highest R² values, “Random Forest No Fine Insurance” has higher
RMSE and lower R², indicating that important variables like
fire_insurance have a significant impact on model accuracy, and their
exclusion can lead to poorer predictive performance.
- Models labeled with “AIC Model” show varied performance. The “Top 2
AIC Model” performs relatively well, but the “No More Insurance AIC
Model” has one of the highest RMSE values, indicating that in real
estate, simplistic models might miss critical factors affecting property
values, resulting in less accurate predictions.
- Geographical Factors: Segmenting the rental market by geographical
location can yield valuable insights. Models with detailed geographical
data (such as “Random Forest with Feature Engineering & HTT”)
performed well, suggesting that location-specific features are
significant predictors of rental prices.
In the end than we could say that…
Best Performing Model: “Random Forest with Feature Engineering &
HTT” stands out with the lowest RMSE (226.064) and the highest R²
(0.994). This model is exceptionally effective in minimizing prediction
error and explaining variance.
Worst Performing Model: “No More Insurance AIC Model” has the highest
RMSE (1938.126) and one of the lowest R² values (0.572). This indicates
poor predictive performance and a lack of fit to the data. The analysis
suggests a strong emphasis on detailed, data-driven approaches to
understand rental price dynamics and segment the market effectively. The
focus should then be on geographical segmentation, economic and
demographic factors, and property features to guide investment
decisions. By leveraging advanced models and comprehensive data, the
company can better predict high rental revenue areas and optimize their
real estate investment strategy.
We use the first Rio de Janeiro house in the dataset to predict the rent using different models and compare the predictions to the actual rent amount. In doing so, we demonstrate how the models perform on unseen data. The house presents the following characteristics:
| city | area | rooms | bathroom | parking.spaces | floor | animal | furniture | hoa | property_tax | fire_insurance | area_rooms | hoa_property_tax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rio de Janeiro | 72 | 2 | 1 | 0 | 7 | acept | not furnished | 740 | 85 | 25 | 144 | 62900 |
In particular, it has a rent amount of 1900.
| Model | Predicted_Rent |
|---|---|
| GAM | 2009.351 |
| GAM with FE | 1979.483 |
| Elastic Net | 1995.501 |
| Random Forest | 1873.851 |
| RF No Fire | 1742.200 |
| RF with FE | 1880.802 |
| RF with FE & HT | 1897.092 |
| Actual Rent | 1900.000 |
Model Complexity and Generalization:
- Hyperparameter Tuning: The addition of hyperparameter tuning (RF with
FE & HT) improved the Random Forest model’s accuracy, suggesting
that fine-tuning model parameters can lead to better generalization and
prediction performance.
- Elastic Net Regularization: The Elastic Net model, which balances
between L1 and L2 regularization, provided a good estimate but slightly
overestimated the rent. This indicates that while regularization helps
in preventing overfitting, it may also lead to underfitting in some
cases.
The analysis highlights the importance of using advanced modeling techniques like Random Forests with feature engineering and hyperparameter tuning for accurate rent prediction in real estate. These models outperform simpler models by effectively capturing the complex relationships between various property features and their impact on rent. The results also underscore the value of comprehensive and detailed data, which is essential for making informed decisions in the real estate market. By leveraging these insights, real estate professionals can better understand market dynamics, set competitive rental prices, and make data-driven investment decisions.
ELBOW METHOD:
The elbow plot shows the total within-cluster sum of squares (WSS)
against the number of clusters (k). As k increases, the WSS decreases.
This is expected because adding more clusters generally reduces the
distance between points and their assigned cluster centers. The “elbow”
point, where the rate of decrease in WSS sharply slows down, is around k
= 3. This suggests that adding more clusters beyond 3 doesn’t
significantly improve the compactness of the clusters.
SILHOUETTE METHOD:
The silhouette plot shows the average silhouette width for different
numbers of clusters. The silhouette width measures how similar a point
is to its own cluster compared to other clusters. A higher silhouette
width indicates better-defined clusters. The highest average silhouette
width is observed at k = 2. This suggests that the data forms
well-defined clusters when divided into two groups. However, the
silhouette width for k = 3 is also relatively high, indicating that
three clusters might also be a reasonable choice.
Real Estate Context Application:
- k = 2: The properties might be clustered into two broad categories,
such as luxury vs. budget properties.
- k = 3: The clusters might represent three segments, such as luxury,
mid-range, and budget properties. Understanding these clusters can help
in setting competitive prices tailored to each segment. For example,
properties in the luxury cluster can be priced higher due to their
premium features. By clustering the properties, real estate analysts can
better understand the market dynamics and the distribution of property
types, helping in targeted marketing and sales strategies.
k = 2 Clusters:
The two clusters are well-separated, indicating distinct
groupings in the data. Cluster 1 (red) appears to have higher values
along the second dimension (Dim2), suggesting it might represent
properties with higher values for certain features. Cluster 2 (blue)
likely represents the remaining properties, which could be more common
or lower-value. Therefore in real estate, These two clusters might
reflect a broad segmentation of the real estate market into high-value
and standard properties.
k = 3 Clusters:
The addition of a third cluster introduces another grouping (green),
which provides more granularity. The clusters are fairly well-separated,
with some overlap between Cluster 2 and Cluster 3. Cluster 1 (red)
remains similar, representing properties with distinct characteristics
or higher values. Cluster 2 (green) and Cluster 3 (blue) could represent
a further subdivision of the more common properties, perhaps splitting
them into mid-range and budget segments. This can help in more refined
market segmentation, allowing for more precise targeting of customer
segments and better-informed investment decisions.
Comparisons between the two clusters
divisions:
With k=2, the division is simpler and might be easier to interpret for
broad market analysis. With k=3, the segmentation is more detailed,
providing deeper insights into the property market. The analysis of k=2
and k=3 clustering shows that both provide valuable insights into the
real estate market. The choice between them depends on the level of
detail required. For high-level market segmentation, k=2 is sufficient,
while k=3 offers more granular insights that can guide detailed pricing
and marketing strategies. Both approaches are useful in identifying
property segments that can be targeted for specific strategies,
improving decision-making in the real estate industry.
| Var1 | Freq |
|---|---|
| 1 | 7114 |
| 2 | 2947 |
| Var1 | Freq |
|---|---|
| 1 | 5468 |
| 2 | 2985 |
| 3 | 1608 |
We determined the number of houses in each cluster after applying k-Means clustering with different numbers of clusters (k).
For k = 2 Clusters: Cluster 1: 7,114 houses - Cluster 2: 2,947 houses
For k = 3 Clusters: Cluster 1: 5,468 houses - Cluster 2: 2,985 houses - Cluster 3: 1,608 houses
As the number of clusters increases, the houses are distributed across more groups, leading to a more granular segmentation of the dataset. Initially, with fewer clusters, each cluster contains a large number of houses, representing broader market segments. The sizes of the clusters vary significantly, indicating that some segments of the market are much larger than others. This is typical in real estate, where certain types of properties might dominate the market. By understanding the distribution of houses across clusters, real estate professionals can develop targeted marketing campaigns for different property segments. Identifying clusters helps in setting appropriate pricing for different segments. For example, properties in smaller clusters might command higher prices if they are identified as luxury or unique segments. Clusters with fewer houses might represent niche markets with high potential for growth, guiding investment strategies. The clustering results provide a clear view of the segmentation within the real estate market, helping to identify broad and niche segments. This information is crucial for making informed decisions related to marketing, pricing, and investment. Understanding the distribution of properties across clusters enables real estate professionals to tailor their strategies effectively to meet market demands.
We performed also hierarchical clustering on a dataset and visualizes the clusters for k=2 and k=3.
Hierarchical Clusters (k = 2):
Two distinct clusters are visible. One cluster is relatively
small and compact (red), while the other is larger and more spread out
(blue). The red cluster could represent a specific type of property,
such as high-end or unique properties. The blue cluster likely
represents a broader category, such as standard properties. That is the
same thing experienced in the analysis before since we are in the real
estate market.
Hierarchical Clusters (k = 3):
With three clusters, the data is divided further, adding a new cluster
(green) that represents a more granular division of the data. The three
clusters likely represent different property segments, such as budget,
mid-range, and luxury properties. The green cluster might indicate
mid-range properties, providing a more detailed market segmentation.
These interpretation are similar in fact to those of KNN models showing
consistency between the two models. Real estate companies can allocate
resources more effectively by focusing on dominant clusters for general
strategies and dedicating specialized efforts to unique or smaller
market segments.
k-Means Clustering (k = 2):
The k-Means algorithm divides the data into two clusters.
- Cluster 1 (red) is compact and concentrated at lower values along the
first dimension (Dim1).
- Cluster 2 (blue) is larger and more spread out, encompassing a broader
range of values along both dimensions.
The boundary between the clusters is defined by the nearest means, which
is typical for k-Means clustering.
Hierarchical Clustering (k = 2):
Hierarchical clustering also divides the data into two clusters. The
clusters are similarly defined, with Cluster 1 (red) being compact and
Cluster 2 (blue) being more spread out. However, the boundaries between
clusters can appear different compared to k-Means because hierarchical
clustering uses a different approach to forming clusters, based on
distance measurements and a hierarchical merging process.
Comparisons:
Both clustering methods result in two distinct clusters, but the shapes
and boundaries differ. 1. k-Means forms clusters by partitioning the
space into cells based on the mean positions, leading to more circular
cluster shapes. 2. Hierarchical clustering can form clusters with more
irregular shapes as it merges points based on the distance, which might
better capture the natural structure of the data.
k-Means Clustering (k = 3):
- Cluster 1 (Red): Compact cluster with lower values in both dimensions,
indicating properties with smaller size and fewer features.
- Cluster 2 (Blue): Largest and most spread-out cluster, containing
properties with a wide range of values, possibly representing mid-range
properties.
- Cluster 3 (Green): Smaller cluster with a narrow range along the first
dimension but spread along the second dimension, possibly representing a
specific type of property with unique features.
Hierarchical Clustering (k = 3):
- Cluster 1 (Red): Similar to k-Means, this cluster is compact and
represents properties with lower values in both dimensions.
- Cluster 2 (Blue): More irregularly shaped than in k-Means, but still
the largest cluster, indicating a diverse range of properties.
- Cluster 3 (Green): This cluster has a more distinct shape compared to
k-Means, showing that hierarchical clustering can better capture
non-spherical distributions and might represent properties with specific
characteristics.
Comparisons:
In k-Means clustering clusters tend to be more spherical or elliptical.
The boundaries are influenced by the centroid positions, leading to
simpler shapes. With hierarchical clustering, clusters can take on more
complex, irregular shapes, capturing the natural structure of the data
more flexibly.
k-Means is better.
- Large Datasets: you have a large dataset and need efficient, quick
clustering, that is indeed our case.
- Number of Clusters Known: you have a good idea about the number of
clusters you need, since we need three clusters because of the market
segmentation.
- Speed and Simplicity: you need a simple, fast solution for initial
segmentation or when computational resources are limited, that can be
useful for us to provide the fastest and better solution for the
project. KNN, therefore, is ideal for large real estate datasets where
quick segmentation is needed to identify broad market segments, helping
in setting pricing strategies and marketing campaigns efficiently.
| cluster_kmeans | area | rooms | bathroom | parking.spaces | hoa | property_tax | fire_insurance | count |
|---|---|---|---|---|---|---|---|---|
| 1 | 66.17648 | 1.691478 | 1.004938 | 0.8641185 | 543.8307 | 90.91203 | 28.25055 | 5468 |
| 2 | 171.45729 | 3.355109 | 1.821441 | 1.7018425 | 812.6265 | 276.94640 | 56.72864 | 2985 |
| 3 | 315.57836 | 3.685323 | 1.204602 | 2.1007463 | 2265.2425 | 1109.47015 | 112.86940 | 1608 |
The table summarizes the characteristics of three clusters identified through k-Means clustering. Each cluster represents a distinct segment of rental properties with specific attributes.
Cluster 1:
- Low-Cost Housing: Cluster 1 represents smaller, budget properties with
fewer amenities and lower maintenance costs.
- Affordable Living: The lower HOA fees, property taxes, and insurance
costs indicate that these properties are more affordable, likely
appealing to low to middle-income tenants.
- Rental Market: This segment likely dominates the market due to its
large size, indicating a high demand for budget-friendly rentals.
- Investment Potential: These properties may offer stable rental income
with lower entry costs, making them attractive to investors looking for
steady cash flow.
Cluster 2:
- Balanced Offerings: Cluster 3 represents mid-range properties with
moderate sizes and amenities, appealing to middle-class tenants.
- Moderate Costs: The HOA fees, property taxes, and insurance costs are
moderate, reflecting a balance between affordability and quality.
- Growing Segment: This segment is substantial in size, indicating a
significant portion of the rental market, catering to tenants looking
for a balance between cost and comfort.
- Investment Perspective: Mid-range properties offer a good compromise
between cost and return, making them an attractive option for investors
seeking balanced portfolios. Investors can diversify their portfolios by
targeting different clusters. Budget properties offer stability and
lower risk, luxury properties promise high returns, and mid-range
properties provide a balanced investment option with moderate returns
and risk. Policymakers and developers can use these insights to address
housing needs.
Cluster 3:
- High-End Segment: Cluster 2 represents luxury properties with large
areas and more rooms, appealing to high-income tenants.
- High Maintenance: The significantly higher HOA fees, property taxes,
and insurance costs reflect the premium nature of these properties,
often located in desirable neighborhoods with upscale amenities.
- Exclusive Market: This cluster is smaller, indicating a niche market
segment. These properties are likely to command high rental
prices.
- Investment Opportunity: Despite higher costs, luxury properties can
offer substantial returns through high rental income and potential for
capital appreciation.
Emphasizing affordable housing development can cater to the largest market segment, while also creating opportunities for luxury and mid-range developments to meet diverse demands.
## Average Silhouette Width: 0.2743028
The silhouette width is a measure of how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1, where: 1: Indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. 0: Indicates that the object is on or very close to the decision boundary between two neighboring clusters. -1: Indicates that the object is misclassified and is actually closer to a neighboring cluster than to the cluster it is assigned to. This value suggests a moderate clustering structure, implying that while some points are well clustered, others are not as clearly defined: - An average silhouette width of approximately 0.27 indicates that the clusters are reasonably well formed, but there is room for improvement. The clusters are not tightly defined, and there may be overlap between clusters. - The clusters have some internal cohesion but are not entirely separated from each other. This suggests that while some data points fit well within their assigned clusters, others are close to the boundaries between clusters. - In the context of real estate data, a moderate silhouette width might indicate that the property characteristics used for clustering do not perfectly segment the market. There could be overlap in characteristics among different property types or areas, suggesting a diverse and interconnected market.
| cluster_kmeans | mean_rent |
|---|---|
| 1 | 2125.167 |
| 2 | 4027.109 |
| 3 | 8293.474 |
The table shows the mean rent for each cluster identified through k-Means clustering. This information is crucial for understanding the rental market segmentation and making informed decisions regarding pricing, marketing, and investment. - Cluster 1, Budget Properties: This cluster represents properties with the lowest average rent. The low mean rent suggests that these are budget-friendly properties, typically smaller in size, fewer amenities, and possibly located in less prime areas. - Cluster 2, Mid-Range Properties: The mean rent for this cluster falls between the budget and luxury segments. These properties are likely mid-sized with moderate amenities, making them suitable for middle-income tenants seeking a balance between cost and quality. Mid-range segment catering to tenants looking for better living standards without the premium costs. - Cluster 3, Luxury Properties: his cluster has the highest mean rent, indicating that it comprises luxury properties with larger areas, more rooms, and additional amenities such as parking spaces. These properties are likely located in premium areas with higher HOA fees, property taxes, and insurance costs. This cluster appeals to high-income tenants looking for premium housing options.
The boxplot illustrates the distribution of rental prices across three clusters. Each cluster represents a distinct segment of the rental market, identified through k-Means clustering.
Cluster 1: Budget Properties
- Median Rent: The median rent for Cluster 1 is the lowest among the
three clusters.
- IQR: The rent distribution is narrow, indicating low variability in
rental prices within this cluster.
- Outliers: There are some high outliers, suggesting that while most
properties are budget-friendly, a few might be priced higher due to
unique features or locations.
Cluster 2: Mid-Range Properties
- Median Rent: The median rent for Cluster 2 is moderate, falling
between Clusters 1 and 3.
- IQR: The rent distribution is also moderate, indicating a balance in
rental prices and variability.
- Outliers: There are fewer high outliers compared to Cluster 3,
suggesting that mid-range properties maintain more consistent rental
prices.
Cluster 3: Luxury Properties
- Median Rent: Cluster 3 has the highest median rent, indicating it
represents the luxury segment.
- IQR: The rent distribution is broad, showing significant variability
in rental prices. This suggests a range of luxury properties with
diverse amenities and features.
- Outliers: Numerous high outliers indicate properties with
exceptionally high rents, likely due to prime locations or exclusive
features.
The boxplot provides valuable insights into the distribution of rental prices across different market segments. Each cluster represents a distinct profile, offering opportunities for targeted investment, pricing, and marketing strategies in the real estate market. Understanding these clusters helps in making informed decisions that align with the diverse needs of tenants and investors.
ANOVA ANALYSIS:
## [1] "ANOVA Results for k-Means Clusters:"
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(cluster_kmeans) 2 4.780e+10 2.390e+10 6255 <2e-16 ***
## Residuals 10058 3.843e+10 3.821e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "ANOVA Results for Hierarchical Clusters:"
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(cluster_hierarchical) 2 3.488e+10 1.744e+10 3416 <2e-16 ***
## Residuals 10058 5.135e+10 5.106e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA (Analysis of Variance) results test the differences in rental prices across clusters identified by hierarchical and k-Means clustering methods. Cluster Comparison Effectiveness: k-Means Clustering: The higher F value and sum of squares for k-Means clustering indicate that this method provides a more distinct separation of rental prices among clusters. This suggests that k-Means clustering is more effective in segmenting the rental market by distinguishing between properties with different rental values. The higher mean square value further emphasizes that k-Means clusters capture a larger portion of the variability in rental prices, making it more suitable for understanding the rental market dynamics. Hierarchical Clustering:
Hierarchical clustering also shows significant differences in rental prices between clusters but to a lesser extent than k-Means. It suggests that while hierarchical clustering provides a meaningful segmentation, it might not capture the rental price variations as distinctly as k-Means.
Market Segmentation: k-Means Clustering: More effective in identifying distinct market segments in the rental market, which is critical for targeted marketing, pricing, and investment strategies. Can be used to clearly identify luxury, mid-range, and budget segments, enabling better decision-making and strategy formulation. Hierarchical Clustering: Useful for exploratory analysis and understanding the hierarchical structure of the rental market. Helps in identifying subtle differences in the market that may not be as pronounced but still relevant for a more granular analysis.
Pricing and Investment Strategies: k-Means Clustering: The clear separation of clusters suggests that rental properties can be priced and marketed according to well-defined segments. Luxury properties can command higher rents, while budget properties can be offered at competitive prices to attract more tenants. Investors can leverage the distinct clusters to focus on specific market segments with clear expectations of rental income and market positioning. Hierarchical Clustering: Provides insights into the gradation of the rental market, which can help in pricing properties along a continuum rather than in distinct steps. Useful for identifying opportunities for niche investments where properties might straddle between segments, offering potential for higher returns through targeted improvements or repositioning.
Both k-Means and hierarchical clustering show significant differences in rental prices across clusters, with k-Means providing more distinct and effective segmentation. This suggests that k-Means clustering might be more suitable for applications requiring clear market segmentation, such as pricing strategies and targeted investments.
COMPARISONS BY RENT AND CITY:
The boxplot visualizes the distribution of rental prices across different cities and clusters.
Rental Price Variation by Cluster:
- Cluster 1 (Magenta): Generally shows lower rental prices across all
cities. This cluster likely represents lower-value or budget
properties.
- Cluster 2 (Blue): Displays mid-range rental prices, suggesting
mid-range properties.
- Cluster 3 (Green): Shows the highest rental prices, indicating that
this cluster likely contains high-value or luxury properties.
City Insights:
- Belo Horizonte: Rental prices vary significantly across clusters.
Cluster 3 (green) has the highest median rental prices, indicating a
presence of luxury properties. Cluster 1 (magenta) has the lowest
prices, representing budget properties.
- Campinas: Similar trends are observed, with Cluster 3 (green) showing
high rental prices, while Cluster 1 (magenta) indicates budget
properties. Cluster 2 (blue) represents a middle ground.
- Porto Alegre: Also shows a distinct separation of rental prices by
clusters, with Cluster 3 (green) having the highest prices.
- Rio de Janeiro: Cluster 3 (green) continues to dominate with high
rental prices, indicating a significant luxury property segment.
- São Paulo: Displays a broad range of rental prices within clusters.
Cluster 3 (green) again has the highest prices, highlighting the
presence of premium properties.
The distinct separation of rental prices by cluster within each city highlights the effectiveness of the clustering approach in identifying different market segments. This is crucial for targeted marketing and pricing strategies. Understanding the rental price distribution across clusters helps in setting competitive rental prices. For example, properties in Cluster 3 (green) can be positioned as premium rentals with higher price points. The visualization helps identify high-value clusters across cities, guiding investors towards segments with higher rental yields. Conversely, it highlights budget property clusters, which might offer opportunities for value investments or developments. The comparison across cities shows that luxury properties (Cluster 3) command high rental prices in all cities. This information is useful for understanding regional variations in rental markets and making informed investment decisions.